Discovering findings that replicate from a primary study 
of high dimension to a follow-up study 

Marina Bogomolov and Ruth HelleiQ 



Technion and Tel-Aviv University 

Abstract. We consider the problem of identifying whether findings replicate from one 
study of high dimension to another, when the primary study guides the selection of hy- 
potheses to be examined in the follow-up study as well as when there is no division of roles 
into the primary and the follow-up study. We show that existing meta-analysis methods 
are not appropriate for this problem, and suggest novel methods instead. We prove that 
our multiple testing procedures control for appropriate error-rates. For FWER control, 
the only requirement is the independence of p-values across studies. For FDR control, we 
prove that if the p-values within each study are PRDS dependent or independent, the FDR 
of our novel procedures is controlled. We demonstrate the usefulness of these procedures 
via simulations and examples. 

Keywords: False discovery rate; genome-wide association studies ; meta-analysis; multiple 
comparisons; replicability analysis 



^Address for correspondence: Department of Statistics and Operations Research, Tel- Aviv uni- 
versity, Tel-Aviv, Israel. This work was supported by grant no. 2012896 from the Israel Science 
Foundation (ISF). E-mail: ruheller@post.tau.ac.il 



1 Introduction 



In genomics research, it is customary that a primary study is followed by an inde- 
pendent study. Reporting results from the primary study, and then reporting the 
evidence from the follow-up study that supports these results, gives a sense of the 
replicability of the results. For example, findings are informally regarded as replicated 
if the p- value for testing a null hypothesis is small in the primary study, and then for 
the same hypothesis the p-value is fairly small in the follow-up study. 

Many approaches are available for an alyzing two or more studies, where the follow-up 



studie s simply serv e to ad d pow er. See 



(I2OO5I) 



SkoletaL 



(120061), and 



Hedges and Olkinl dlQSSh 



Zeggini et al. 



Benjamini and Yekutieli 



(120071 ). among others. In this work, we 



focus on analyzing two studies, where the follow-up study serves to confirm the find- 
ings that were identified in the primary study. A formal statistical approach is pro- 
posed for evaluating whether results from a primary study were indeed replicated in 
a follow-up study. 

In observational studies, an association may fail to replicate because the discovered as- 



socia tion was not the actual effect of a treatment but rather that of bias (iRosenbauml . 



2OOII ). However, if the finding is replicated in a different cohort, using different diag- 



nostic or laboratory methods, then the as s ociati on between effect and outcome may 



be more convincingly causal. 



RosenbaumI (120011 ) gives the example of radiation and 



leukemia. Suppose higher rates of leukemia are discovered in a primary study among 
radiologists, and in a follow-up study among survivors at Hiroshima and Nagasaki. 
Radiation is more convincingly causal if the association discovered was replicated in 
the follow-up study, since if radiation was not a cause of leukemia, then higher rates 
of leukemia among radiologists would not lead us to expect higher rates of leukemia 
among survivors at Hiroshima and Nagasaki. Another example comes from the field 



1 



of genomic research. Genome-wide association (GWA) studies are observational stud- 
ies, and the r efore there is always a danger that bias may explain away the discoveries. 



Kraft et al. 



( I2OO9I ) note that for common variants, the anticipated effects are modest 
and very similar in magnitude to the subtle biases that may affect genetic association 
studies - most notably population stratification bias. For this reason, they argue that 
it is important to see the association in other studies conducted using a similar, but 
not identical, study base. 

It is common practice that interesting findings in a primary GWA study are inves- 
tigated in another study, an d the interesting results of both studies are reported 



(ILander and Kruglyak 



I995I ). For example, to discover asso ciation between single- 



nucleotide polymorphisms (SNPs) and hippocampal volume, IB is et al.l ( 120 12l ) tested 



2.5 X 10^ SNPs i n a primary stud y, and only a handful of SNPs in promising loci in a 



follow-up study. 



Bis et al. 



( I2OI2I ) considered a SNP in the first study to be forwarded 
for replication if the SNP p- value in the primary study was below 4 x 10^ correspond- 
ing to one expected false positive if all SNPs are not associated with hippocampal 
volume. They viewed the SNP as containing evidence of replication if its p- value in the 
follow-up study was below 0.01, which is the Bonferroni threshold when 5 hypotheses 
are simultaneously tested at the 0.05 family-wise error rate (F WER). Their approach 



selects hypotheses for follow-up based on suggestive evidence ( ILander and Kruglyak 



I995I ). and corrects for multiplicity only in the second study when discussing evidence 
of replicability. Another naive approach is the following: apply a multiple testing pro- 
cedure within each study separately, and declare as replicated the common findings. 
This approach will lead to declaring SNPs that were found to be associated with the 
disease in the first study as well as in the follow-up study as the discoveries of interest. 
If there was no danger that a multiple testing procedure produces false positives, then 
this naive approach would have been appropriate. However, multiple testing proce- 
dures have a non-zero probability of producing false positives, unless they have no 



power. Therefore, an approach that provides control over false positives in each study 
separately, does not guarantee control over false positives for evaluating whether the 
results were replicated. Figure [3l left panel, shows that the FDR level can be as high 
as one when naively declaring results as replicated if they were discovered by apply- 
ing an FDR controlling procedure at the nominal 0.05 level separately in each study. 
Moreover, reducing the nominal 0.05 does not resolve the problem, see Remark 13.11 

The paper is organized as follows. Section |2] gives the notation and review. Section [3] 
suggests novel multiple testing procedures for replicability analysis, when the primary 
study guides the selection of hypotheses to be examined in a follow-up study. Section 
m considers the setting where there is no division of roles into a prirn ary and a follow- 



up study. In Section |5l we revisit the example of iBis et al. 



(I2OI2I ). and show that 



our suggested formal approach to replicability concurs with their main findings. In 
Section [5] we al so show additional examples from the GWAS simulator IIAPGEN2 



(jSu et al. 



2OIII ). Section |6] describes a simulation study, and Section [7] gives some 



final remarks. 



2 Notation, Goal, and 

Consider a family of m elementary null 
null hypotheses, or a subset thereof, are 
Let hij be the indicator of whether Hj 



Review 

hypotheses Hi, ... , H^- These elementary 
tested in each of two independent studies, 
is false in study i. The pair of indicators 



3 



{hij, h2j) identifies 4 possible settings for eacli j, 



{hij, h 



2iy 



(0, 0) if Hj is true in botli studies, 

(1, 0) if Hj is false in the primary study but true in the follow-up study, 

(0, 1) if Hj is true in the primary study but false in the follow-up study, 

(1, 1) if Hj is false in both studies. 

The set of indices {1, . . . , m} of the elementary null hypotheses may be divided into 
four (unknown) subsets /qo U Jiq U Jqi U In = {1, . . . ,m}, where each index j is in 
exactly one of the four subsets, defined as follows: Jqo = {j '■ {hij,h2j) = (0,0), j G 
{l,...,m}};/io = {j : (/iii,/i2i) = (1,0), j G {1, . . . , m}}; Jqi = {j : (/iij,/i2j) = 
(0,1), J G {l,...,m}};/n = {j : (/ii„ /isj) = (1,1), J G {l,...,m}}. 

Definition 2.1. The no replicability null hypothesis for elementary hypothesis Hj is 



^iVij,, :(/^i„/^2,)e{(0,0),(0,l),(l,0)}. 



By definition, H^ji j is false if and only if the elementary null hypothesis Hj is 
false in both studies considered. In the family of m composite null hypotheses 
Hnr,i, • • • , Hj^ji m, the sets of indices of true and false null hypotheses are JooU/oiU/io 
and 111 respectively. Our goal is to discover as many indices from In as possible, i.e. 
true positives, while controlling for the number of discoveries from /qo U Jqi U Jiq, i.e. 
false positives. 

Let the p-values for study i he Pi = pn, . . . ,Pim, for i = 1,2. Since the studies are in- 
dependent, the p-values are independent across studies. However, the p-values within 
each study may be dependent. Inequality x > y for vectors x and y is understood 
comp onent wise . 



Remark 2.1. In a typical meta-analysis ((Hedges and Olkirs . \l98d) . the goal is to 



discover as many indices from Jqi U Jiq U In while controlling for the number of 
discoveries from Iqq. Had we known, and had it been true, that Jqi = and Iiq = 0, 
then the same methods for meta-analysis may serve to discover replicable findings. 
However, it is not known in practice whether Jqi and Iiq are empty sets, and they 
need not be empty when the follow-up study is different, in at least one aspect of 
design, from the primary study. Therefore, typical meta-analysis methods are not 
appropriate when the aim is to discover hypotheses with indices in In, treating all 
discoveries from Jqi and Iiq, in addition to Iqq, as false discoveries. 



2.1 The partial conjunction approach 



In iBeniamini et al.l (120091 ) the partial conjunction approach (IBenjamini and Helleii . 



20081 ) has been suggested for rephcability analysis when n > 2 studies are available 



that exar nine the same probleni . When exactly two studies are available, the pro- 



cedure in 



Benjamini et al. 



(120091 ) amounts t o applying the Benjamini-Hochb erg false 



19951 ). hence- 



discovery rate (FDR) controlling procedure (IBenjamini and Hochbergl . 
forth referred to as the BH procedure, on the maximum of the two study p-values. 
However, this procedure may be too conservative, making it practically very difficult 
to discover false no rephcability null hypotheses. 

As an example, suppose there is an original GWA study that examines the association 
of 10^ SNPs with a phenotype. Now suppose 200 promising SNPs were selected to 
be examined in a follow-up study. If a SNP has a value of 0.025/10^ in the first 
study, and of 0.025/200 in the second study, then the maximum p-value is 0.025/200. 
The BH procedure will, most probably, not reject the no rephcability null hypothesis 
for a SNP with maximum p- value of 0.025/200, since this maximum p- value is not 
strong enough evidence when faced with 10^ hypotheses, out of which most of the 
hypotheses are true no rephcability null hypotheses. The alternative procedures we 



suggest in Sections [3] and H] will view the evidence from this SNP as strong enough 
for it to be considered a replicated finding. 



3 Replicability analysis with a primary and a follow- 
up study 

For the family of m no replicability null hypotheses Hnr^i, . . . , HNR,m, we consider 
two relevant error measures: the probability that at least one no replicability null 
hypothesis was falsely rejected, that is FWER, and the expected fraction of false 
rejections out of all rejections of no replicability null hypotheses, that is the FDR. 

Procedure 3.1. The two stage FWER controlling procedure for testing the family of 
no replicability null hypotheses with parameters {ai,a), where < ai < a < 1.- 

1. Let TZi he the set of indices of elementary hypotheses that are selected for testing 
in a follow-up study based on the data from the primary study. 

2. Apply a FWER controlling procedure at level ai, using the data from the pri- 
mary study only, on the family of null hypotheses Hi, ... , Hm, and let TZp C 
{1, . . . , m} be the set of indices of rejected hypotheses. Apply a FWER con- 
trolling procedure at level a — a\, using the data from the follow-up study only, 
on the family of selected null hypotheses {Hj : j G TZi}, and let TZj C TZi be 
the set of indices of rejected hypotheses. Then the set of indices of rejected no 
replicability null hypotheses is TZf Ci TZp. 

Theorem 3.1. For two independent studies. Procedure \3.1\ controls the FWER at 
level a for the family of no replicability null hypotheses Hnr^i, . . . , HNR,m- 

Proof. Let Vp = X]je-R.p(^ ~ ^li) ~ J^jeiifi^ ~ ^2j) be the number of true 

6 



elementary null hypotheses rejected in the primary study and in the follow-up study, 
respectively, by Procedure 13.11 Then 



FWER < E{I[Vp + Vf> 0]) = E{I[Vp > 0]) + E{E{I[Vf > 0]\pi)) < en + a - en = a, 

where the last inequality follows from the fact that Vf is independent of the data 
from the primary study, and that in both stages a FWER controlling procedure is 
applied. | 



Using Bonferroni in Pro cedure 13.11 amount s to rejecting Hj^p> j if {pij,P2j) < {c(i/fn, («- 
ai)/|7^i|), for j G TZi. Alternatively, the results can be reported in terms of Bonferroni- 



replicability adjusted p- values for fixed c = ai/a: p 



Bonf-REPadj 



max {mpij/c, \'Ri\p2jl (1 - c)) 



Procedure l3.1l using Bonferroni is equivalent to rejecting all hypotheses with Bonferroni- 
replicability adjusted p- values at most a. 

In many modern applications, controlling the FWER is unnecessary and results in 
overly conservative i nferences. In genomics resear ch, it is often eno u gh to guaran- 



Storey and Tibshiranil ( 120031 ) and 



Reiner et al 



( 120031 ) ■ among 



tee FDR control, see 
others. 

Procedure 3.2. The two stage FDR controlling procedure for testing a family of no 
replicability null hypotheses with parameters {qi,q), where < qi < q < 1: 



1. Let TZi be the set of indices of elementary hypotheses that are selected for testing 
in a follow-up study based on the data from the primary study. Let Ri = \TZi\ 
be the cardinality of this set. 



2. Let 



Ry = max I r : / 

ie7^l 



/ ^ ^ I rqi r{q - qi] 

{Pij,P2j) < , B 

* m Ri 



7 



Then the set of indices of rejected no replicability null hypotheses is 



^2 = I J : {Pij,P2j) < 



m Ri 



.3 en, 



The results of Procedure 13.21 can be reported in terms of FDR-replicability adjusted 
p- values. Let c = gi/g, 



Zj = max 



(3.1) 



and let < . . . < ^(^i) be the sorted Z- values. Then the ith largest FDR- 
replicability adjusted p- value is 



REPadj ■ ^{j) 

Pr\ = nim , 

j>i j 



(3.2) 



Procedure 13.21 with parameters (gi,g) = {cq,q) is equivalent to rejecting all no repli- 
cability null hypotheses with FDR-replicability adjusted p-values below q. 



Definition 3.1. A valid selection rule for step 1 of Procedure \3. 2\ satisfies the follow- 
ing condition: for any j G TZi, fixing all the p-values except for pij and changing pij 
so that Hij is still selected, will not change the set TZi. 



It is easy to see that this condition is satisfied if TZi contains the smallest fixed 
number of p-values, all hypotheses with p-value below a given threshold, or if TZi 
contains the rejected indices from a BH procedure on the p-values from the pri- 
mary study. Adaptive FDR pro c edures on the p-valu es from the primary stud y, e.g. 



Beniamini and Hochberd (120001) 



Blanchard and Roquain 



Storey et al. 



mm 



Benjamini et al. 



(120061), and 



(|2009[ )). are no n- valid selection rules. 



Theorem 3.2. // all the p-values are jointly independent and the selection rule in 
step 1 of Procedure \3. 2\ is a valid selection rule, then Procedure \3.2\ controls the FDR 



8 



at level q for the family of no replicability null hypotheses Hj^ji^i, . . . , Hi^^^rn- 
See Appendix |A] for the proof. 

The selection rule affects the power of Procedure 13.21 A natural choice for a selection 
rule is the set of rejected hypotheses by a BH procedure at level qi on the primary 
study p-values, since the set of indices of rejected no replicability null hypotheses 
is a subset of this set. A rule that selects by the BH procedure at level q is not 
as good as the rule at level gi, since any additional hypotheses selected will not be 
rejected but will result in a more severe threshold on the follow-up study p-values. 
We observed in simulations that the BH procedure at level qi is very close to selecting 
the optimal number of hypotheses for follow-up, and therefore we recommend using 
it when there are no additional constraints that require choosing only k (a small 
number) of hypotheses for follow-up. 

The choice of qi affects the power of Procedure 13.21 We observed in simulations 
(Section |6]) that to maximize power, if the p-values of the false no replicability null 
hypotheses tend to be smaller for the primary study, then choosing qi/q < 0.5 may 
be best, but if the p- values have the same distribution in both primary and follow-up 
study, then choosing qi/q > 0.5 is best. The optimal choice of qi depends on m, -Ri, 
and the distribution functions of the p- values for false no replicability null hypotheses, 
and therefore guidelines for choosing qi are application specific. In Section |5] we 
discuss the choice of qi for GWAS. 

Theorem 13.21 assumes independence of the p- values within each study as well as across 
the studies. However, the assumption of independence among the p-values within 
each study may not be realistic in many applications. Particularly, in GWA studies 
there is dep endency across the SNPs, t herefo re the p-values within each study may be 



dependent. 



Benjamini and Yekutielil (l200l[ ) proved that the BH procedure controls 



9 



the FDR when the p- values have a special dependency called PRDS. 



Definition 3.2. (iBenjamini and Yekutieli . 



200m) The set of p-values Pi, ... , Pm has 



property PRDS if for any increasing set D, and for each true null hypothesis i, 
Pr{{Pi, . . . , Pm) E D\Pi = p) is nondecreasing in p . 



In order to extend the result in Theorem 13.21 for this type of dependency, we need 
additional constraints on the selection rule. We consider the general case that hy- 
potheses may be grouped to, say, L groups, and let gj G {1, . . . , L} be the group label 
of hypothesis j, for j = 1, . . . , m. Note that L = m if the hypotheses are not grouped. 
In GWA studies, SNPs may be grouped into pathways or sets that belong to genes. 

Definition 3.3. For hypotheses with grouping gi, . . . ,gm, let ii be the index of the 
smallest primary study p-value for the hypotheses in group I, I E {1, . . . , L} . A k-valid 
selection rule for step 1 of Procedure \3. 21 is a valid selection rule which selects the k 
hypotheses corresponding to the k smallest p-values among {pij^, • • • yPi^}- 

If the hypotheses are not grouped, then the fc-valid selection rule amounts to selecting 
the hypotheses that correspond to the k smallest p-values from the primary study. In 
GWAS, grouping at the gene level may be reasonable, and the most promising SNP 
from each of the k most promising genes is a /c-valid selection rule. 

Theorem 3.3. Assume the set of p-values has property PRDS within each of the 
studies, and that the p-values across studies are independent. If the selection rule in 
step 1 of Procedure 1 3. 2\ is a k-valid selection rule for a fixed k defined prior to analysis, 
then Procedure \3.S\ controls the FDR at level q for the family of no replicability null 
hypotheses Hnr,i, HNR,m- 

The proof is given in Appendix 



10 



Remark 3.1. 



Beniamini and Yekutieli ^200a) proved in their Proposition 3 that the 



procedure that applies the BH procedure at level qi on the primary study p-values, 
and the BH procedure at level q — qi on the follow-up study p-values, controls the 
FDR at level qi{q — qi) < q on the family of global null hypotheses, Hgi, ■ ■ ■ , Hcm, 
where Hgj : (/iy, /?-2j) = (0,0). However, on the family of no replicability hypothe- 
ses, Hjyji^i, . . . , HNR,m, th^ FDR of this procedure may be higher than the nominal 
level q. The key difference between Procedure \3.S\ and such a naive procedure, is the 
requirement that the two p-values from a selected hypothesis have to simultaneously 
be smaller than two thresholds. Therefore, in an extreme scenario where all hypothe- 
ses are from Iiq or Iqi, and the p-values from true non-null hypotheses are zero, the 
naive procedure may have an FDR of one as follows. The BH procedure on the pri- 
mary study p-values will reject all hypotheses from Iiq but also few from /qi (when 



'Oil 



and |/io| are large enough), and the hypotheses from Jqi will be rejected by the 
BH procedure on the follow-up study p-values. However, Procedure 13.21 will have an 
FDR level below q. To see this, note that in order to r eject a no re plicability null 



198m) for the inter- 



hypothesis by Procedure \3.2[ the p-value of the Simes test ASimea . 
section of the elementary hypotheses indexed by Jqi, using the data from the primary 
study, has to be below qi, or the Simes test p-value for the intersection of elementary 
hypotheses indexed by Iiq fl IZi, using the data from the follow-up study, has to be 
below q — qi. Therefore, the probability of rejecting at least one no replicability null 
hypothesis, which coincides with the FDR since all no replicability null hypotheses are 
true, is at most q. See Figure\B, right panel, for a more realistic simulated example. 



11 



4 Replicability analysis with no division into pri- 
mary and follow-up studies 



Consider now a situation where both studies are available before the analysis. If 
some of the elementary hypotheses are examined in only one of the studies, then 
these hypotheses are not considered for replicability analysis. In this setting, there 
is no primary study and follow-up study. We propose the following generalization of 



Procedure 13. 2[ that can be tuned to treat the two studies symmetrically. Without 
loss of generality, we label the studies as study one and study two. 

Procedure 4.1. The generalized two stage procedure for testing a family of no 
replicability null hypotheses with parameters {wi,qi,q), where < Wi < 1 and 
< gi < g < 1 .- 

1. Apply Procedure \3. 2\ with parameters {wiqi,Wiq) with study one as the primary 
study and study two as the follow-up study. Denote the set of indices of rejected 
no replicability null hypotheses by TZu^wig- 

2. Reverse the roles of study one and study two. Apply Procedure \3. S\ with parame- 
ters ((1 — Wi)qi, (1 — Wi)q). Denote the set of indices of rejected no replicability 
null hypotheses by 'R.2i,{i-wi)q- 

3. The set of indices of rejected no replicability null hypotheses is 7^i2,u;iqU7^2i,(i-wi)g 
Theorem 4.1. Assume that the p-values across studies are independent. Then Pro- 



cedure 4-i controls the FDR at level q for the family of no replicability null hypotheses 



Hnr,i, ■ ■ ■ , H]s[R^rn in either one of the following situations: 



1. The set of all the p-values within each study are independent, and the selection 
rule in step 1 of Procedure \3. 2\ is a valid selection rule. 

12 



2. The set of all the p-values within each study has property PRDS, and the se- 
lection rule in step 1 of Procedure \3. S\ is the k-valid selection rule, with k fixed 
prior to analysis. 

See Appendix [D] for the proof. 

Choosing wi = 1 results in Procedure I3.2[ where study one has the role of the primary 
study and study two has the role of the follow-up study. Similarly, choosing wi = 
results in Procedure 13.21 with the roles of study one and study two reversed. The 
choice < Wi < 1 reflects the similarity of Procedure 14.11 to Procedure 13.21 in the 
following way: when Wi is close to one (zero), Procedure 14.11 gives similar results to 
Procedure 13.21 with study one (two) as the primary study. The choice Wi = 0.5 results 
in a variant of Procedure 13.21 that is symmetric with respect to both studies. 



5 GWA studies examples 



W e reproduce in T able [H columns 1-4, a subset of the columns of Table 1 of results 



of 



Bis et al 



(I2OI2I ). We added in columns 5-7 the FDR-replicability adjusted p- 
values for qi/q G {0.2,0.5,0.8}. Procedure 13.21 with parameters (gi,g) = (0.025,0.05) 
identified the SNP near MSRB3 as having replicated association with the phenotype, 
and Procedure 13.21 with parameters (gi,g) = (0.04,0.05) identified the SNPs near 
MSRB3, WIFl and HRK as having replicat ed association w ith the phenotype. These 



results concur with the main conclusions of iBis et al. 



( 120121 ). A heuristic justification 



for preferring c = qi/q = 0.8 over c = qi/q < 0.5 when the primary study selects 
only a handful of SNPs for follow-up is the following. Since the adjusted p-values in 
(13. 2 p require adjustment of the primary study p-values by the factor m/c, and of the 
follow-up study p-values by i?i/(l — c), then a reasonable guess is to choose c to make 
these factors equal, if the distribution function of the p-values in the primary and 



13 



Tab le 1: The p- valu es of SNPs from the primary and follow-up studies, from Table 
1 of Bis et al. ( 2012[ ) (columns 3-4), and the FDR-replicability adjusted p- values for 
various choices of c = qi/q (columns 5-7). 



Locus 


Gene 


Primary study 


2q24 


DPP4 


5.2 X 10"^ 


9q33 


ASTN2 


1.0 X 10-^ 


12ql4 


MSRB3 


5.5 X 10-9 




WIFl 


2.2 X 10-^ 


12q24 


HRK 


4.8 X 10-^ 



Follow-up study 



FDR-replicability adjusted p- values 



c = 0.2 


c = 0.5 


c = 0.8 


0.8750 


1.0000 


1.0000 


0.3125 


0.5000 


1.0000 


0.0688 


0.0275 


0.0344 


0.1375 


0.0550 


0.0344 


0.2000 


0.0800 


0.0500 



0.7 

0.2 
0.002 
0.0007 

5.8 X 10-^ 



follow-up studies is assumed to be similar for false no replicability null hypotheses. 
By this heuristic, c = m/(m + Ri) > 0.5 is the preferred choice. In this example, the 
sample size in the follow-up study was smaller than the sample size in the primary 
study, therefore c = 0.8 is more reasonable than c = 2.5 x 10^/(2.5 x 10^ -|- 5). 

As another exam ple, we simulated two GWA studies from the simulator HAPGEN2 



fSu et al. 


2011 


). The two studies were generated 


project r 


rhe International HapMap Consortium 


1 



20031 ) ■ a sample of 165 Utah res- 



idents with Northern and Western European ancestry (CEU), and a sample of 109 
Chinese in Metropolitan Denver, Colorado (CHD). In the CEU and CHD populations, 
respectively, 34 and 38 SNPs were set as causal with an increased multiplicative rel- 
ative risk of 1.2. The two populations had 18 causal SNPs in common. In order to 
identify the SNPs in each study where the null hypothesis is false, the simulation 
of 4500 cases and 4500 controls from the population was repeated 10 time s, and 10 



jp-va. 



ues were produced per SNP. SNPs with Fisher's combined value ( iLoughin . 
2004J ) below the Bonferroni threshold were considered to be truly associated with 
the disease. Our ground truth included 1355 and 1010 SNPs associated with the 
disease in the CEU and in the CHD population, respectively, out of which 274 SNPs 
were associated with the disease in both populations. The simulated studies retained 
the linkage disequilibrium across SNPs as measured for the samples in the HapMap 



14 



project. 



We generated from the CEU and CHD populations 11 pairs of datasets, in which 
4500 cases and 4500 referents were sampled from each population. As a standard 
preprocessing step, we removed SNPs with minor allele frequency below 0.05, and 
thus the number of SNPs in the analysis was reduced from 1,387,466 to 887,362, on 
the average. Table [2] presents the average number of replicated findings, as well as 
the average false discovery proportion (FDP) for the different methods. The standard 
error (SE) is presented in parentheses. From the last column we see that the average 
FDP was below 0.05 using all procedures. From rows 1 and 2 we see that if there 
is no division into primary and follow-up studies, then the symmetric Procedure 14.11 
discovers more SNPs with replicated associations than the BH procedure on maximum 
values, while maintaining a low FDP. From rows 3-5, and 6-8, we see that the choice 
of which study was the primary study had a large effect on the average number of 
discoveries, and the choice of qi mattered little. 

Table 2: For 4500 cases and 4500 referents in both studies, the average number of 
associated and causal SNPs discovered (SE), and the average FDP (SE), for different 
procedures. The selection rule for Procedure 14. II was the BH procedure at level wiqi 
when study one was the primary study, and at level (1 — Wi)qi when study two was 
the primary study. 



Procedure 








# Replicated findings 


FDP 










associated SNPs (SE) 


causal SNPs (SE) 


(SE) 


BH on maximum p-values 




29.182 (3.205) 


7.364 (0.432) 


0.000 (0.000) 


lO with wi 


= 0.5,gi 


= 0.025, g 


= 0.05 


77.727 (6.378) 


11.455 (0.366) 


0.011 (0.005) 


lO with wi 


= = 


0.01, g = 


0.05 


74.091 (6.748) 


10.364 (0.310) 


0.012 (0.006) 


lO with wi 


= 1,^1 = 


0.025, g = 


= 0.05 


76.091 (6.221) 


10.727 (0.359) 


0.012 (0.005) 


lO with wi 


= 1,^1 = 


0.04, g = 


0.05 


69.545 (5.745) 


10.818 (0.352) 


0.009 (0.005) 


KT\ with wi 


= 0,gi = 


0.01, g = 


0.05 


35.545 (4.575) 


7.364 (0.607) 


0.008 (0.008) 


KT\ with wi 


= 0,gi = 


0.025, g = 


= 0.05 


41.455 (5.294) 


8.273 (0.469) 


0.007 (0.007) 


lO with wi 


= 0,gi = 


0.04, g = 


0.05 


42.273 (4.158) 


8.545 (0.312) 


0.000 (0.000) 



15 



6 A simulation study 



The goal of the simulations was threefold. First, to investigate the effect of the 
choice of qi and Wi on the power of procedures 13.21 and 14. 1[ Second, to compare 
these procedures to the alternative of applying BH on the maximum p-values, i.e. 
the partial conjunction approach when exactly two studies are analyzed. Third, to 
investigate the effect of the selection rule on the power of the procedures. 

The procedures compared were (1) the BH procedure at level 0.05 on maximum p- 
values; (2) Procedure O with wi G {0,0.5,1}, c = qi/q G {0.1, 0.2, . . . , 0.9}, and 
q = 0.05; and (3) the naive (BH-i, BH-j) procedure, i,j G {1, 2}, i ^ j, which applies 
the BH procedure at level 0.05 on the p-values of study i, and separately on the 
p-values of study j for the hypotheses that were rejected at study i, and declares 
hypotheses rejected in both studies as false no replicability null hypotheses; (4) the 
oracle BH procedure on maximum two study p-values at level 0.05, which applies the 
BH procedure on maximum p-values at level x0.05; (5) the oracle Procedure l4.1l 

with parameters (gi, q) = (g', 2g'), where q' was the solution to ^-^{q')'^ + {^-^ + l)q' = 
0.05. This oracle procedure controls the FDR at level 0.05, see Appendix [B] for a proof. 

The p-values were generated independently as follows. For Hj, j = 1, . . . ,m, = 
1 - $(^) and P2, = 1 - $(^), where ~ N{fi,^,al) and X^, ~ N{fi,^,al). 
We let /ijj = ■ (1 — hij) + ji^ ■ hij, where i E {1, 2}, and /ij G {0.5, 1, . . . , 5}. We 
set m = 1000, and fij = \Iij\/m for i,j G {0,1} as follows: /oo = 0.9, fu = 0.1; 
/oo = 0.9, /oi = /lo = 0.025, fu = 0.05; foi = ho = 0.5; /oo = 0.8, U = ho = 0.1. 
The standard deviations o"i and a2 were either fixed values G {0.3, 1}, i G {1,2}, 
or reflected the fraction of sample size allocated to the first study: cxi = a/\/^(N, 
a2 = a/y/{l-C)N, a = 10, C e {0.1, 0.2, . . . , 0.9} , N = 1000. 

The simulation results were based on 1000 repetitions. The FDR was estimated 



16 



by averaging the FDP. The average power was estimated by the average number of 
rejected false no rephcabihty null hypotheses, divided by mfu. 

6.1 Simulation results 

As expected from our theoretical results, in all the settings considered the estimated 
FDR was below 0.05 for all procedures but the naive (BH-i, BH-j) procedure. The 
SE of the estimated FDR and power were of the order of 10~^ for all procedures under 
all configurations considered. 

The oracle Procedure 14.11 with wi = 1 dominated, in terms of power, the oracle BH 
procedure on maximum j9-values and the oracle Procedure 14.11 with wi G {0,0.5}. 
Figure [H compares procedures (1) and (2) above with the most powerful oracle, in a 
configuration with parameters (Xi = 0.3, a2 = 1, /oo = 0.9, /oi = /lo = 0.025, fu = 
0.05. For each procedure the estimated power and FDR is shown as a function of the 
common expectation under the alternative, fi = = /ij- Procedure 14. II with wi = 1 
is more powerful than with wi = 0.5 or wi = 0, while the choice Wi = is the worst 
in terms of power of Procedure 14.11 Moreover, Procedure 14.11 with wi G {0.5, 1} is 
more powerful than the BH procedure on maximum p-values. These findings were 
consistent across all configurations of /oo; /lO; /oi; /ii examined, when ai = 0.3 and 
o"2 = 1. Since the oracle Procedure 14.11 with wi = 1 and the BH procedure on 
maximum p-values do not depend on qi, their power curves are the same in figures 
(a), (b), and (c). We see that Procedure 14. II with wi = 1 is a close second to the oracle 
when qi is 0.01 but is farther from the oracle as qi increases. Similarly, the power of 
Procedure 14.11 with wi = 0.5 decreases as qi increases. However, Procedure 14.11 with 
Wi = has largest power for qi = 0.04, and the least power for qi = 0.01. These 
results are reasonable since the p-values of study one tend to be much smaller than 
the p-values of study two when the no replicability null hypotheses are false. In Table 

17 



[3] we see that if the p- value distribution of false no replicability null hypotheses is the 
same across studies, then the optimal choice of qi is qi > q/2. For example, when 
/i = /i]^ = /i2 = 2 (row 2), the power is 0.65 with qi = 0.005, 0.77 with qi = 0.045, 
and the maximum power is 0.81 with qi = 0.035. 

Table 3: The power of Procedure 13.21 with the BH selection rule at level 0.05c, for 
different values oi ji = Hi = /i2, with ai = a2 = 0.5, /oo = 0.9, /oi = /lo = 0.025, /n = 
0.05. The optimal value of c is in bold. 











c 


= gi/0.05 








^^ 


0.1 


0.2 


0.3 


0.4 


0.5 0.6 


0.7 


0.8 


0.9 


1.5 


0.143 


0.195 


0.224 


0.245 


0.257 0.258 


0.248 


0.226 


0.181 


2.0 


0.646 


0.718 


0.755 


0.778 


0.794 0.803 


0.805 


0.800 


0.769 


2.5 


0.934 


0.955 


0.965 


0.971 


0.975 0.977 


0.978 


0.978 


0.974 



Figure 12] compares the procedures (1) and (2) above for the same configuration of 
fij, but for fixed = /i^^ = ^-^^ varying sample size of the two studies. The 
varying power is described by the fraction of sample allocated to the first study. 
For the symmetric procedures, we see that for = 0.1 the power is the lowest, and 
it increases to reach its maximum for equal allocation C, = 0.5. Procedure 14.11 with 
Wi = 0.5 dominates the BH procedure on the maximum two study p-values. For 
Procedure 14.11 with wi = 1, the maximum is reached for C > 0.5. It is the most 
powerful of the three procedures examined for ( > 0.6. 

In Figure|3]we consider the FDR level of Procedure 14. II with wi G {0, 0.5, 1}, as well as 
of the naive procedure. The estimated FDR of (BH-i, BH-j) procedure exceeds 0.05 
in the settings where /lo = /oi = 0.5 and /oo = 0.8, /lo = /oi = 0.1. In these settings 
the estimated FDR of both (BH-1, BII-2) and (BH-2, BH-l) procedures are increasing 
functions oi n = = /Xg; reaching one in the setting where /lo = /oi = 0.5 (left), and 
0.4 in the setting where /oo = 0.8, /lo = /oi = 0.1 (right). Clearly, procedure (BH-i, 
BH-j) is not valid since it may be far too liberal in terms of FDR level. 



18 



(a) gi=0.01 



(b) 91=0.025 



(c) gi=0.04 



Figure 1: Power as a function oi ^ = fi^ = /ig, for qi of (a) 0.01, (b) 0.025, and (c) 
0.04, using the following procedures: the oracle Procedure 14.11 (solid with circles); the 
BH procedure at level 0.05 applied on maximum p- values (dash-dotted); Procedure 
14.11 at level 0.05 with wi = (dashed), wi = 0.5 (dotted), and wi = 1 (sohd), where 
the selection rule at steps 1 and 2 is the BH procedure at levels Wiqi and (1 — Wi)qi, 
respectively. The remaining parameters were /oo = 0.9, /oi = 0.025, /lo = 0.025, /n = 
0.05, /^i = = ^1 = 0-3 and o"2 = 1. 

Finally, we examined how the selection rule affects the power. In Figure H] we show the 
power as a function of /x^ for Procedure 14 . 1 1 with parameters Wi = 0.5, qi = 0.025, q = 
0.05, for the following selection rules: BH at level 0.0125; the rule that selects the 
hypotheses with k smallest primary study p-values, where k G {25, 35, . . . , 100}. The 
remaining parameters were: /oo = 0.9, /oi = /lo = 0.025, /n = 0.05, ai = 0.5, (72 = 
1,1^2 — 3- For different values of the optimal k is different, and using the BH 
procedure for selection is optimal for the entire range of fii- 



7 Discussion 

In many research areas first a primary study is analyzed, then a follow-up study is 
analyzed with the goal to corroborate the findings, or at least a subset of the findings, 
of the primary study. We suggested novel testing procedures for corroborating the 



19 




Figure 2: Power as a function of fraction ( of sample size allocated to the primary 
study, for (a) = /i2 = 2, and (b) /^i = /i2 = 3, for Procedure 14.11 with wi = 1 
(solid), with wi = 0.5 (dotted), and of the BH procedure on the maximum of two 
studies p-values (dash-dotted) at level q = 0.05. The remaining parameters were 
/oo = 0.9, /oi = 0.025, /lo = 0.025, /ii = 0.05, sample size N = 1000, standard 
deviation cr = 10. 




012345 012345 



Figure 3: FDR versus /i = /i^ = for /oi = /lo = 0.5 (left) and /oo = 0.8, foi = 
fiQ = 0.1 (right), for the following procedures at level q = 0.05: BH-1, BH-2 (solid 
with circles); BH-2, BH-1 (dashed with circles); Procedure 14.11 with qi = 0.025 and 
Wi of 1 (solid), 0.5 (dotted), or (dashed). The standard deviations were cxi = 0.3 
and £72 = 1. 



20 




Figure 4: Power as a function of /i^ for Procedure 14. II with parameters Wi = 0.5, qi = 
0.025, q = 0.05 for the following selection rules: BH at level 0.0125 (solid black 
curve); selection of the hypotheses with k smallest primary study p- values, where 
k = 25 (solid green curve), k = 75 (solid red curve), k G {30,35, . . . , 100} (dashed 
grey curves). The remaining parameters were: foo = 0.9, /oi = /lo = 0.025, /n = 

0.05, (Ti = 0.5, (72 = 1, fJ,2 = 3. 



evidence from a primary study in a follow-up study. We demonstrated their usefulness 
on a GWAS application . In the setting where there is no division of roles to a primary 
and a follow-up study, the simulations suggested that our novel Procedure 14.11 with 
Wi = 0.5 is more powerful than the BH procedure on maximum p-values. 



We proved that Procedures 13.21 and 14.11 control the FDR when the p- values are inde- 
pendent within each study and the selection rule is valid, as well as when p-values 
within each independent study have property PRDS and the selection rule is the 
/c-valid selection rule. We can further show, using straightforward extensions of the 
proofs, that Procedure 13.21 controls the FDR on the family of no replicability null 
hypotheses for any valid selection rule if: (1) the p-values in the first study are inde- 
pendent and the p-values in the second study have property PRDS; (2) if instead of 
qi we use the threshold qi/ {J21Li V"^); ^^e p- values in the first study have any type of 
dependency, and the p-values in the second study have PRDS dependency. Extensive 
simulations demonstrated that the BH procedur e cont rols the FDR for many types 



of dependence encountered in practice (lYekutieh 



20081). We conjecture that this ro- 



21 



bustness property carries over to our procedures. For simulated GWAS examples the 
average false discovery proportion was below the nominal FDR level, suggesting that 
the procedure is indeed valid for the type of dependency that occurs in GWAS. 

Replicability analysis, as suggested in this paper, requires that the investigators make 
several key design choices in addition to the error level q: the selection rule, gi, and 
Wi if two studies are available without division into primary and follow-up. The 
power of the procedure for replicability analysis varies with these choices. From our 
investigations, it appears reasonable in Procedure 13.21 to select hypotheses by BH at 
level gi, and to set Wi = 0.5 in Procedure 14.11 if the p-value distributions may be 
assumed to be similar in both studies for false no replicability null hypotheses. We 
gave some guidelines for choosing qi in specific settings, and more general guidelines 
are a topic for future research. 

If estimation of |/oo| and |/oi| is possible with reasonable accuracy, or these terms can 
be bounded from below, then incorporating these estimates instead of the true values 
used for the oracle Procedure 14 . 1 1 may lead to substantial increase in power. We leave 
this estimation problem to future research as well. 

References 

Benjamini, Y. and Heller, R. (2008). Screening for partial conjunction hypotheses. 
Biometrics, 64:1215-1222. 

Benjamini, Y., Heller, R., and Yekutieli, D. (2009). Selective inference in complex 
research. Philosophical Transactions of the Royal Society A (accepted), 267:1-17. 

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate - a 



22 



practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B Met.., 57 
(l):289-300. 

Benjamini, Y. and Hochberg, Y. (2000). On the adaptive control of the false discovery 
fate in multiple testing with independent statistics. Journal of educational and 
behavioral statistics., 25(1): 60-83. 

Benjamini, Y., Krieger, M., and Yekutieli, D. (2006). Adaptive linear step-up false 
discovery rate controlling procedures. Biometrika, 93 (3):491-507. 

Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in 
multiple testing under dependency. The Annals of Statistics, 29 (4):1165-1188. 

Benjamini, Y. and Yekutieli, D. (2005). Quantitative trait loci analysis using the false 
discovery rate. Genetics, 171:783-790. 

Bis et al. (2012). Common variants at 12ql4 and 12q24 are associated with hip- 
pocampal volume. Nature genetics, page doi:10.1038/ng.2237. 

Blanchard, G. and Roquain, E. (2009). Adaptive false discovery rate control under 
independence and dependence. Journal of machine learning research, 10:2837-2871. 

Hedges, L. and Olkin, I. (1985). Statistical Methods for M eta- Analysis. Academic 
Press, London. 

Kraft, P., Zeggini, E., and loannidis, J. (2009). Replication in genome-wide associa- 
tion studies. Statistical science, 24 (4):561-573. 

Lander, E. and Kruglyak, L. (1995). Genetic dissection of complex traits: guidehnes 
for interpreting and reporting hnkage results. Nature genetics, 11:241-247. 

Loughin, T. (2004). A systematic comparison of methods for combining p- values from 
independent tests. Computational Statistics and Data Analysis, 47:467-485. 



23 



Reiner, A., Yekutieli, D., and Benjamini, Y. (2003). Identifying differentially ex- 
pressed genes using false discovery rate controlling procedures. Bioinformatics, 
19(3):368-375. 

Rosenbaum, P. (2001). Replicating effects and biases. The american statistician, 55 
(3):223-227. 

Simes, R. (1986). An improved bonferroni procedure for multiple tests of significance. 
Biometrika, 73 (3): 751 - 754. 

Skol, A., Scott, L., Abecasis, G., and Boehnke, M. (2006). Joint analysis is more effi- 
cient than replication-based analysis for two-stage genome- wide association studies. 
Nature Genetics, 38:209-213. 

Storey, J., Taylor, J., and Siegmund, D. (2004). Strong control, conservative point 
estimation, and simultaneous conservative consistency of false discovery rates: A 
unified approach. Journal of the Royal Statistical Society, Series B, 66:187-205. 

Storey, J. and Tibshirani, R. (2003). Statistical significance for genomewide studies. 
Proceedings of the National Academy of Sciences, 100 (16):9440-9445. 

Su, Z., Marchini, J., and Donnelly, P. (2011). Hapgen2: simulation of multiple disease 
snps. Bioinformatics, 27 (16):2304-2305. 

The International HapMap Consortium (2003). The International Hapmap Project. 
Nature, 426:789-796. 

Yekutieli, D. (2008). Comments on: Control of the false discovery rate under depen- 
dence using the bootstrap and subsamphng. Test, 17 (3):458-460. 

Zeggini, E., Weedon, M., Lindgren, C, Prayhng, T., EUiott, K., Lango, H., Timpson, 
N. Perry, J., and Rayner, N. (2007). Repfication of genome- wide association signals 
in uk samples reveals risk loci for type 2 diabetes. Science, 316:1336-1341. 



24 



A Proof of Theorem 



Let q2 = Q — Qi, and for each j G {1, . . . , m}, let and denote the vectors 
Pi = (Pii, Pi2, . . . , Pim) and Pa = (P21, P22, • • • , P2m) with, respectively, Pij and Pa^- 
excluded. For j G {l,...,m} arbitrary fixed, let 7li\Pi^) C {1, . . . , j — 1, j + 
1, . . . ,m} be the subset of indices selected along with index j. Note that since the 
selection rule is valid, this subset is fixed as long as Pij is such that j is selected based 
on {Pi\ Pij)- For any j G {1, . . . , m} and given Pi\ for i G 1, . . . , j — 1, j + 1, . . . , m 
we define 

00 otherwise. 

Let T(i) < . . . , < T(^m-i) be the sorted T-values, and T(0) = 0. For r = 1, . . . , m, we 
define Cr"''' as the event in which if Hnrj is rejected by Procedure 13. 2[ r hypotheses 
are rejected including Hnrj- 

Cl^^ = {{P[^\ P^'^) : T^r-i) < r, T^r) > r + 1, T(,,+i) > r + 2, . . . , T(„_i) > m}. 

Obviously, Cr"''' and C^'/'* are disjoint events for any r ^ r', and U^^LiCr^^ is the 
entire space Iq = -^01 U -^00, -Rj be the indicator of whether H^rj 

was rejected for j = 1, . . . ,m, and R = ^JLi-Rj- The FDR for the family of no 
replicability null hypotheses is 




25 



First, we find an upper bound for tlie first term of tlie sum in (lA.ip . 

Vmaxfi?, 1)/ \ m i-n i / 



nil 

£ E E (^'' s ^- c^?') = E E (^.. £ ^) (c-'") (A.2) 

m \ T \ 



m ^ — ' ^ — ' m 

jeio r=l 



The equality in (I A. 2 1) follows from the independence of the p-values. The inequality 
in f lA.SP follows from the fact that for each j G Iq, Pr(Pij < x) < x for all x G (0, 1). 
Finally, the equality in (]A.3P follows from the fact that U^^^Cr"'^ is the entire sample 



space of {Pi''\ P2''^), represented as a union of disjoint events. Next, we find an upper 
bound for the second term of the sum in flA.ll) . Let 7li{pi) be the set of selected 
indices using Pi = pi. Then E ^Xlje/io niax(_R, 1) | Pi = pi j equals to: 



l^l(Pl)l , X X 

E_ E7'h^3p<^.^^.^i"iA^P. 

l^i(Pi)l . / X 

E E > (A.4) 

on7^l(pl) r=i ^ I ^^-^^^1 / 



ie/ion7ei(pi) r=i 

17^1 (P 

< 



iG/lon7^l{pl) r=i 

1^1 (pi) I 



E, ^ E P.(c»|P.^P.)^^|/.o^,^.(pO|. (A..) 



je/ion7?,i(pi) r-=i 



The equality in flA.Sp follows from the conditional independence of Cr^ and the event 



{P2j < T"(l2/\'^i{Pi)\}- The inequality in f lA.6P follows from the independence of the 
p-values across the studies and the fact that for each j G /lo, Pr(P2i < x) < x for all 
X G (0, 1). The equality in (1A.6I) follows from the fact that U^^i^^^^Cr^ is a union of 
disjoint events, and Pr (u^Jt^^''^^ d^'^ \ Pi = pi) = 1. 

26 



It follows from ( 1A.6P that E ^X^jg/m ^j/ iiiax(i?, 1) j < 5'2- Using this fact and the 
bound (1A.3P for the first term of (lA.ip . we obtain: 



FDR < qi + (g - qi) < gi + (g - qi) = q. 

m 



B Proof for FDR control of the oracle Procedure 



3.2 



Let us now prove that under the assumption that the p- values are independent, Pro- 
cedure [3]2] at levels {q', 2q') controls the FDR at level |/oo| {q'Y /"^ + d-^oil/'"^ + 1) Q.' ■ 
Returning to the proof of Theorem I3.2[ note that (1A.1|) can be rewritten as follows. 



We will now give an upper bound for each term of the sum in (IB.ip . First, 

^ f = E E ^ T^^-p^, s ^, p., < ^,ciA 

Vmaxiit, 1)/ ^-^ r \ m \1ZA I 

'"i/ ' \ ( irl 

^ E E £ s c») < ^ E E W") = ^(^') 

je/oo '■=1 jG/oo »'=l 



(B.2) 



The second inequality in flB.2p follows from the fact that for each j G /qo, Pij and 
P2j are independent and Pr(Pjj < x) < x for all a; G (0, 1) and z = 1, 2. The equality 
in (1B.2I) follows from the arguments that are given for the equality in (1A.3I1 . 



Second, replacing Jq by /qi and |/o| by |/oi| in the arguments that led to (1A.3|) . we 



27 



obtain: 



E,^oeioJV\^Mq', (B.3) 



max(i?, 1) / m 



Finally, using ( ]A.6I) in the proof of Theorem 13 .21 we obtain that the third term of the 
sum in (IB.ip is bounded by q2 = 2q' — q' = q' ■ Using this upper bound, together with 



the bounds for the first two terms derived in (IB.2p and fIB.Sp . we obtain: 



Moo / /n2 I Moi / I / rOO / /n2 I / Moi I t \ / 

FDR < (g )^ H g + 9 = ' -{q ) + H 1 N ■ 

m m m \ m J 

It follows that if |/oo| and |/oi| were known, one could guarantee FDR control at level 
q on the family of no replicability null hypotheses by applying Procedure 13.21 at levels 
(g', 2g'), where q' is the solution to |/oo| {q'Y /m+ (|/oi|/m + 1) g' = g. 



C Proof of Theorem SSI 



We will use the definitions given in Appendix |Al In addition, for s = 1, . . . , m — 1, 
we define the event D^J^ as follows: 

D^p = {{P['\ P!f^) : T(,) > s + 1, T(,+i) > s + 2, . . . , T(^_i) > m}, 

and we define Dm to be the entire sample space of {p'f \ p!^'^). Note that D^ = 
is easy to see that Z^P^ is the event in which if H^^j is rejected by 



Procedure I3.2[ at most s hypotheses are rejected including H^rj- 
Lemma C.l. Under the assumptions of Theorem \3.3[ 



1. For each p2 = {p2i, ■ ■ ■ ,P2m) , s G {1, . . . , A; - 1}, and j e h, D^ f}{p!^^^ = p^i^} 



28 



is an increasing set for P{ , i.e. if (P-^ , P2 ) ^ fl {P2 = P2 }' ^'^^ 
P['^ > P[^\ then {Pi'\P^'^) G Z^F n {P^^'^ = pi'^}. 

2. For each pi = {pn, ■ ■ ■ ,Pim), s G {1, . . . , and j G Jio, Di^^n{Pi^^ = p?} 
is an increasing set for P2\ i.e. if {P[^\P2^) G -Dp'' fl {Pi"''' = Pi"^} and 
P? > P?, then {Pi'\Pi'^) G D^'^ n {P^^'^ = p['^}. 

3. For each j G Jq and pi^\ Y!1=i Pr (d^^ | Py < ^, Pi^'^ = p?) < 1. 
^. For each j G Jio and p'f \ Y.'l=i Pr (<^^^'^ I < ^, i^i^^ = P?) < 1- 

See Section IC.ll for a proof. 

As in the proof of Theorem 13.21 we find an upper bound for each one of the terms of 
(lA.ip separately. Note that the number of rejected no rephcabihty null hypotheses is 
bounded by /c, therefore Cr ^ = for any j G {1, . . . , m} and r > k. 

We start with the first term. Let p2 = {P21, ■ ■ ■ ,P2m) be arbitrary fixed. From the 
first inequality in (lA.2p it follows that 

Vmax it, 1 7 ^-^ ^-^ f \ m / 

\ ^ ^ ■> / r=l 

= E E i^'r^' I ^ = (^1. < - I ^2 = P2) 

^ - E E (^^^' I ^ -' = p^') ^ — (c-i) 



j£lo r=l 



The first inequality in (IC.ip follows from the independence of the p- values across the 



studies and the fact that for each j G Jq, Pr(Pij < x) < x for all x G (0, 1). The 
second inequality in (IC.ip follows from Lemma IC.lt item 3. Taking the expectation 



over P2, we obtain E (^j^j^ Rj/ max(P, l)j < (|/o|/m) qi. 

We will now find an upper bound for the second term in flA.ip . Let pi = {pu, . . . , pim) 

29 



be arbitrary fixed. From (lA.4p . it follows that 



E 



I 

max 



je/ion-R,i(pi) r=i 

k 



< ^ E ^ "f^Pi''=P?) < f |/lo^7^l(pO|. (C.2) 



ig/ion7ei(pi) r=i 



The first inequality in ( 10.2^ follows from the independence of the p- values across the 
studies and the fact that for each j G /lo, Pr(-P2j < x) < x for all x G (0, 1). The 
second inequality in flC.2p follows from Lemma FClt item 4. Recalling that |7?.i(pi)| = 



k, and taking the expectation over Pi, we obtain E ^Xljg/oi ^3/ niax(P, l)j < q2. 
Recalling that q2 = q — qi, obtain that FDR < \ I m) qi + q — qi < q. 



C.l Proof of Lemma C.l 



Proof of item 1. Let j G /q, and s G {1, . . . , A; — 1} be arbitrary fixed. Note 



that for all Pi^\ \'}Z[^\pi^'^)\ = k - 1. Therefore, for any {Pi^\pi^'^), T^i^ is finite for 



2 G {1, . . . , /c — 1}. In addition, since s < A; — 1, 

Z}0-) n {p(^-) = pf} = {{P['\p^^) : r(,) > s + 1, r(,+i) > s + 2, . . . , T(fc_i) > A;}. 

Let P[^^ > P[^\ We need to prove that if {P[^\p!^^'^) G P'F n {Pi'''^ = pi'^}, then 
(P^-''^ Ps^-''^) G D^P n {Pa^^'^ = p^^'^}. It is enough to prove that f(i) > T(i) for i = 
l,...,k — 1, where T- values are based on {P[^\p2^), and T- values are based on 
{P[^pf). 

The proof is by contradiction. Assume there exists ani G — 1} such that 



30 



T(j) < T(j). Let Si G lZi\Pi^) and Sj G lZi\Pi'^) be the corresponding indices, 
i.e. T(j) = Tg^, and T(j) = Tj.. Obviously it holds that Pi^j-. < -Pi,^;. Moreover, 
since the selection rule used at step 1 of Procedure 13.21 is a A;- valid selection rule, 
there exists a set of indices {si, . . . C 7li\Pi^), such that the corresponding 

hypotheses belong to i — 1 different groups gsi, ■ ■ ■ that are different from gg., 

and Pij^ < Pi^s2 < ■ ■ ■ < < PiJr For t = 1, . . . , z, Pi^^^ > Pi^st, therefore for 

each t = 1, . . . ,i, Pi^^t < Pi,si- Since Si G 7li\Pi''), Pi^si is the smallest coordinate 
of 

p^^'^ in group (7^-, therefore we obtain that ^ {S'sd ■ ■ ■ ^gsi}- Therefore, there is 
exactly one selected coordinate of P^''^ in each one of the groups gs^, . . . , gsi, and each 
one of these selected coordinates is strictly smaller than Thus we obtain that 

there are at least i T- values which are strictly smaller than T^., contradicting the fact 
that Ts^ = T(^iy 

Proof of item 2. Let j G /lo, Pi^ and s < /c — 1 be arbitrary fixed. The result 
follows from the fact that {T(j), i = 1, . . . , k — 1} = {Tj, i G 'R-i\pi^)}, and for each 
i G TZi\pi^), Ti is increasing in P2i for fixed Pi^ = p^f\ 



Benjamini and Yekutielil (l200l[ ) 



Proof of item 3. We use the technique developed in 
to prove item 3 using item 1. Let j G /q and pg"'^ be arbitrary fixed. Using item 1, 
the PRDS property of the p-values from the primary study and the independence of 
the p-values across the studies, we obtain for each sG{1,...,A; — 1}: 

Pr (d^P I P,, < ^, P(^) = pP) < Pr (dp I P,, < i^±il^, P? = pP) . 

Using the fact that dP U = -D^'^i, where dP and C'fP are disjoint events, we 



31 



obtain for each s & {1, . . . , k — 1}: 



Pr {^Di^^ I Pi, < ^, ) = pi^^) + Ft (ci^, \ A, < 



> + l)gi 



m 



(i) _ ^(i) 



< 



Pr ( I Pij < 



('^ + p(i) 
5 -'2 



P:;^> = ) + Pr ( C?+\ I Pi, < 



P2 



m 



5 -'2 



(i) I . ^ + pO') _ ^(i) 



Pr ( D'J^^ I Pi, < 



m 



"5 2 



P2 



Since Dp'' = C[^\ repeatedly using this inequahty we obtain: 



V m / \ m 

r=l ^ 



dOO _ Ji) 



Benjamini and Yekutielil ( 120011 ) 



Proof of item 4. We use the technique developed in 
to prove item 4 using item 2. Let j G /lo and pf"^ be arbitrary fixed. Using item 2, 
the PRDS property of the p-values from the follow-up study and the independence 
of the p-values across the studies, we obtain for each s G {1,...,A; — 1}: 



Pr (d^P I P2, < ^, Pi^^'^ = V?) < Pr ( \ P^,- < 



+ p(i) _ J3) 

; 1^1 — Pi 



Now the proof is analogous to the proof of item 3 and therefore is omitted. 



D Proof of Theorem 4.1 



Let V,2 = E ie/oou/oiu/io [-^ £ '^i2,ii)iq] and P12 — \'Tli2,wxq\ denote the number of 
erroneously rejected and the total number of rejected no replicability null hypothe- 
ses by Procedure 13.21 at level wiq with study one as the primary study and study 
two as the follow-up study. Similarly, let V21 = Zlje/oou/oiu/io ^ b ^ '^2i,(i-«,i)g] and 
R21 = \'R-2ixi-wi)q\ denote the number of erroneously rejected and the total num- 
ber of rejected no replicability null hypotheses by Procedure 13.21 at level (1 — wi)q 



32 



with study two as the primary study and study one as the follow-up study. Define 
IZs = 7^i2,«)i5 U 7^2i,(i-toi)g, the indices of the no rephcability null hypotheses rejected 
by Procedure SJl Let K = ZljG/oou/oiu/io ^ [-^ ^ "^^l = l'^'*!' number of 

erroneously rejected and the total number of rejected no rephcability null hypotheses 
by Procedure 14.11 

Note that K < V12 + V21. Therefore, 

FDR = e( ^^)<e( ^^^)+e( (D.l) 

\max(ixs, 1) / \max(its, 1) / \max(its, 1) / 

In addition, note that max(i?s, 1) > max(i?i2, 1) and max(i?s, 1) > max(i?2i; !)• Using 
these facts and (1D.1|) we obtain 



FDR = E I J^lr^ ) < E ( )^'' + E ( )|^-T ] <wiq+{l- Wi)q = q, 



max(i?s, 1) / \max(_Ri2, 1) / \max(_R2i, 1) 
where the last inequality follows from Theorems 13.21 and 13. 3[ 



33 



